Import Libraries
# important library
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Get the dataset
# Grab the data
df = pd.read_csv('creditcard.csv')
# chunk of the data
df.head()
Checking Missing Data
# null checking
df.info()
# more details to the data
df.describe()
Looking to the Time feature, we can confirm that the data contains 284,807 transactions, during 2 consecutive days (or 172792 seconds).
Fraud/Genuine Ratio Checking data unbalance
# Data Imbalance Checking
count_classes = pd.value_counts(df['Class'], sort = True)
count_classes.plot(kind = 'bar', rot=0)
plt.title("Transaction Class Distribution")
plt.xticks(range(2))
plt.xlabel("Class")
plt.ylabel("Frequency")
df['Class'].value_counts()
genuine = float(df[df['Class']==0]['Class'].value_counts())
fraud = float(df[df['Class']==1]['Class'].value_counts())
print('Ratio Fraud/Genuine is ',round(fraud*100/genuine,3),'%')
# The data is highly imbalanced
Data Exploration
# Transaction in Time
import plotly.figure_factory as ff
from plotly.offline import iplot
class_0 = df.loc[df['Class'] == 0]['Time']
class_1 = df.loc[df['Class'] == 1]['Time']
class_0.head() #'Time' where 'Class' is 0
hist_data = [class_0, class_1]
group_labels = ['Genuine', 'Fraud']
fig = ff.create_distplot(hist_data, group_labels, show_hist=False, show_rug=False)
fig['layout'].update(title='Credit Card Transactions Time Density Plot', xaxis=dict(title='Time [s]'))
iplot(fig, filename='dist_only')
Fraudulent transactions have a distribution more even than genuine transactions - are equaly distributed in time, including the low real transaction times, during night in Europe timezone.
# Transaction in Amount
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12,6))
s = sns.violinplot(ax = ax1, x="Class", y="Amount", hue="Class",data=df, palette="PRGn")
plt.show();
class_0 = df[df['Class'] == 0]['Amount'] # 'Amount' while 'Class' is 0
class_1 = df[df['Class'] == 1]['Amount'] # 'Amount' while 'Class' is 1
class_0.describe()
class_1.describe()
The real transaction have a lower mean value, larger Q1, smaller Q3 and larger outliers; fraudulent transactions have a smaller Q1 and outlier, larger Q4 and mean.
# let's check fraudulent transactions amount and genuine transaction amount by time
fraud = df[df['Class'] == 1]
genuine = df[df['Class'] == 0]
# Rentang waktu transaksi adalah sekitar 2 hari
f, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(12,8))
f.suptitle('Time of transaction vs Amount by class')
ax1.scatter(fraud['Time'], fraud['Amount'])
ax1.set_title('Fraud')
ax2.scatter(genuine['Time'], genuine['Amount'])
ax2.set_title('Normal')
plt.xlabel('Time (in Seconds)')
plt.ylabel('Amount')
plt.show()
Fraudulent transaction Amount is still not bigger than genuine transaction, but it has big enough value.
Let's check more in how much amount do fraudulent transaction occur and in how much time.
f, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(12,10))
f.suptitle('Amount per transaction by class')
bins = 50
ax1.hist(fraud.Amount, bins = bins)
ax1.set_title('Fraud')
ax2.hist(genuine.Amount, bins = bins)
ax2.set_title('Normal')
plt.xlabel('Amount ($)')
plt.ylabel('Number of Transactions')
plt.xlim((0, 20000))
plt.yscale('log')
plt.show();
# Feature correlation checking
plt.figure(figsize=(14,14))
plt.title('Credit Card Transactions features correlation plot (Pearson)')
sns.heatmap(df.corr(),xticklabels=df.corr().columns,yticklabels=df.corr().columns,linewidths=.1,cmap="RdYlGn")
As expected, there is no notable correlation between features V1-V28. There are certain correlations between some of these features and Time (inverse correlation with V3) and Amount (direct correlation with V7 and V20, inverse correlation with V1 and V5).
Let's plot the correlated and inverse correlated values on the same graph.
Let's start with the direct correlated values: {V20;Amount} and {V7;Amount}.
s = sns.lmplot(x='V20', y='Amount',data=df, hue='Class', fit_reg=True,scatter_kws={'s':2})
s = sns.lmplot(x='V7', y='Amount',data=df, hue='Class', fit_reg=True,scatter_kws={'s':2})
plt.show()
We can confirm that the two couples of features are correlated (the regression lines for Class = 0 have a positive slope, whilst the regression line for Class = 1 have a smaller positive slope).
Let's plot now the inverse correlated values.
s = sns.lmplot(x='V2', y='Amount',data=df, hue='Class', fit_reg=True,scatter_kws={'s':2})
s = sns.lmplot(x='V5', y='Amount',data=df, hue='Class', fit_reg=True,scatter_kws={'s':2})
plt.show()
We can confirm that the two couples of features are inverse correlated (the regression lines for Class = 0 have a negative slope while the regression lines for Class = 1 have a very small negative slope).
Feature density plot
var = df.columns.values
i = 0
t0 = df.loc[df['Class'] == 0]
t1 = df.loc[df['Class'] == 1]
sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(8,4,figsize=(16,28))
for feature in var:
i += 1
plt.subplot(8,4,i)
sns.kdeplot(t0[feature], bw_method=0.5,label="Class = 0")
sns.kdeplot(t1[feature], bw_method=0.5,label="Class = 1")
plt.xlabel(feature, fontsize=12)
locs, labels = plt.xticks()
plt.tick_params(axis='both', which='major', labelsize=12)
plt.show();
For some of the features we can observe a good selectivity in terms of distribution for the two values of Class: V4, V11 have clearly separated distributions for Class values 0 and 1, V12, V14, V18 are partially separated, V1, V2, V3, V10 have a quite distinct profile, whilst V25, V26, V28 have similar profiles for the two values of Class.
In general, with just few exceptions (Time and Amount), the features distribution for legitimate transactions (values of Class = 0) is centered around 0, sometime with a long queue at one of the extremities. In the same time, the fraudulent transactions (values of Class = 1) have a skewed (asymmetric) distribution.
Predictive models
# Define predictors and target values
target = 'Class'
predictors = ['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',\
'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19',\
'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28',\
'Amount']
# Useful Constanta
RFC_METRIC = 'gini' # metric used for RandomForrestClassifier
NUM_ESTIMATORS = 100 # number of estimators used for RandomForrestClassifier
NO_JOBS = 4 # number of parallel jobs used for RandomForrestClassifier
#TRAIN/VALIDATION/TEST SPLIT
#VALIDATION
VALID_SIZE = 0.20 # simple validation using train_test_split
TEST_SIZE = 0.20 # test size using_train_test_split
#CROSS-VALIDATION
NUMBER_KFOLDS = 5 #number of KFolds for cross-validation
RANDOM_STATE = 2018
MAX_ROUNDS = 1000 # lgb iterations
EARLY_STOP = 50 # lgb early stop
OPT_ROUNDS = 1000 # To be adjusted based on best validation rounds
VERBOSE_EVAL = 50 # Print out metric result
IS_LOCAL = False
# Split train data, test data, and validation set data
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(df, test_size=TEST_SIZE, random_state=RANDOM_STATE, shuffle=True )
train_df, valid_df = train_test_split(train_df, test_size=VALID_SIZE, random_state=RANDOM_STATE, shuffle=True )
Random Forest Classifier
Define model parameters Let's set the parameters for the model.
Let's run a model using the training set for training. Then, we will use the validation set for validation.
We will use as validation criterion GINI, which formula is GINI = 2 * (AUC) - 1, where AUC is the Receiver Operating Characteristic - Area Under Curve (ROC-AUC) [4]. Number of estimators is set to 100 and number of parallel jobs is set to 4.
We start by initializing the RandomForestClassifier.
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_jobs=NO_JOBS,
random_state=RANDOM_STATE,
criterion=RFC_METRIC,
n_estimators=NUM_ESTIMATORS,
verbose=False)
# Let's train the RandomFOrestClassifier using the train_df data and fit function
clf.fit(train_df[predictors], train_df[target].values)
# Let's now predict the target values for the valid_df data, using predict function.
preds = clf.predict(valid_df[predictors])
# Features importance
tmp = pd.DataFrame({'Feature': predictors, 'Feature importance': clf.feature_importances_})
tmp = tmp.sort_values(by='Feature importance',ascending=False)
plt.figure(figsize = (7,4))
plt.title('Features importance',fontsize=14)
s = sns.barplot(x='Feature',y='Feature importance',data=tmp)
s.set_xticklabels(s.get_xticklabels(),rotation=90)
plt.show()
# The most important features are V17, V12, V14, V10, V11, V16
# Confusion matrix
cm = pd.crosstab(valid_df[target].values, preds, rownames=['Actual'], colnames=['Predicted'])
fig, (ax1) = plt.subplots(ncols=1, figsize=(5,5))
sns.heatmap(cm,
xticklabels=['Not Fraud', 'Fraud'],
yticklabels=['Not Fraud', 'Fraud'],
annot=True,ax=ax1,
linewidths=.2,linecolor="Darkblue", cmap="Blues")
plt.title('Confusion Matrix', fontsize=14)
plt.show()
Type I error and Type II error
We need to clarify that confussion matrix are not a very good tool to represent the results in the case of largely unbalanced data, because we will actually need a different metrics that accounts in the same time for the selectivity and specificity of the method we are using, so that we minimize in the same time both Type I errors and Type II errors.
Null Hypothesis (H0) - The transaction is not a fraud. Alternative Hypothesis (H1) - The transaction is a fraud.
Type I error - You reject the null hypothesis when the null hypothesis is actually true. Type II error - You fail to reject the null hypothesis when the the alternative hypothesis is true.
Cost of Type I error - You erroneously presume that the the transaction is a fraud, and a true transaction is rejected. Cost of Type II error - You erroneously presume that the transaction is not a fraud and a ffraudulent transaction is accepted.
# Let's calculate the ROC-AUC score
from sklearn.metrics import roc_auc_score, accuracy_score
roc_auc_score(valid_df[target].values, preds)
accuracy_score(valid_df[target].values, preds)
The ROC-AUC score obtained with RandomForrestClassifier is 0.85
AdaBoostClassifier
AdaBoostClassifier stands for Adaptive Boosting Classifier
from sklearn.ensemble import AdaBoostClassifier
clf = AdaBoostClassifier(random_state=RANDOM_STATE,
algorithm='SAMME.R',
learning_rate=0.8,
n_estimators=NUM_ESTIMATORS)
# fit the model
clf.fit(train_df[predictors], train_df[target].values)
preds = clf.predict(valid_df[predictors])
# Features importance
tmp = pd.DataFrame({'Feature': predictors, 'Feature importance': clf.feature_importances_})
tmp = tmp.sort_values(by='Feature importance',ascending=False)
plt.figure(figsize = (7,4))
plt.title('Features importance',fontsize=14)
s = sns.barplot(x='Feature',y='Feature importance',data=tmp)
s.set_xticklabels(s.get_xticklabels(),rotation=90)
plt.show()
# Confusion matrix
cm = pd.crosstab(valid_df[target].values, preds, rownames=['Actual'], colnames=['Predicted'])
fig, (ax1) = plt.subplots(ncols=1, figsize=(5,5))
sns.heatmap(cm,
xticklabels=['Not Fraud', 'Fraud'],
yticklabels=['Not Fraud', 'Fraud'],
annot=True,ax=ax1,
linewidths=.2,linecolor="Darkblue", cmap="Blues")
plt.title('Confusion Matrix', fontsize=14)
plt.show()
# Let's calculate also the ROC-AUC.
roc_auc_score(valid_df[target].values, preds)
accuracy_score(valid_df[target].values, preds)
CatBoostClassifier
CatBoostClassifier is a gradient boosting for decision trees algorithm with support for handling categorical data
from catboost import CatBoostClassifier
clf = CatBoostClassifier(iterations=500,
learning_rate=0.02,
depth=12,
eval_metric='AUC',
random_seed = RANDOM_STATE,
bagging_temperature = 0.2,
od_type='Iter',
metric_period = VERBOSE_EVAL,
od_wait=100)
# fit training the data
clf.fit(train_df[predictors], train_df[target].values,verbose=True)
preds = clf.predict(valid_df[predictors])
# feature importance
tmp = pd.DataFrame({'Feature': predictors, 'Feature importance': clf.feature_importances_})
tmp = tmp.sort_values(by='Feature importance',ascending=False)
plt.figure(figsize = (7,4))
plt.title('Features importance',fontsize=14)
s = sns.barplot(x='Feature',y='Feature importance',data=tmp)
s.set_xticklabels(s.get_xticklabels(),rotation=90)
plt.show()
# Confusion matrix
cm = pd.crosstab(valid_df[target].values, preds, rownames=['Actual'], colnames=['Predicted'])
fig, (ax1) = plt.subplots(ncols=1, figsize=(5,5))
sns.heatmap(cm,
xticklabels=['Not Fraud', 'Fraud'],
yticklabels=['Not Fraud', 'Fraud'],
annot=True,ax=ax1,
linewidths=.2,linecolor="Darkblue", cmap="Blues")
plt.title('Confusion Matrix', fontsize=14)
plt.show()
# Let's calculate also the ROC-AUC
roc_auc_score(valid_df[target].values, preds)
accuracy_score(valid_df[target].values, preds)
XGBoost
import xgboost as xgb
# Prepare the train and valid datasets
dtrain = xgb.DMatrix(train_df[predictors], train_df[target].values)
dvalid = xgb.DMatrix(valid_df[predictors], valid_df[target].values)
dtest = xgb.DMatrix(test_df[predictors], test_df[target].values)
#What to monitor (in this case, **train** and **valid**)
watchlist = [(dtrain, 'train'), (dvalid, 'valid')]
# Set xgboost parameters
params = {}
params['objective'] = 'binary:logistic'
params['eta'] = 0.039
params['silent'] = True
params['max_depth'] = 2
params['subsample'] = 0.8
params['colsample_bytree'] = 0.9
params['eval_metric'] = 'auc'
params['random_state'] = RANDOM_STATE
# train the model
model = xgb.train(params,
dtrain,
MAX_ROUNDS,
watchlist,
early_stopping_rounds=EARLY_STOP,
maximize=True,
verbose_eval=VERBOSE_EVAL)
# feature importance
fig, (ax) = plt.subplots(ncols=1, figsize=(8,5))
xgb.plot_importance(model, height=0.8, title="Features importance (XGBoost)", ax=ax, color="green")
plt.show()
preds = model.predict(dtest)
# let's calculate ROC-AUC
roc_auc_score(test_df[target].values, preds)
# accuracy_score(test_df[target].values, preds)
LightGBM
import lightgbm as lgb
from lightgbm import LGBMClassifier
params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric':'auc',
'learning_rate': 0.05,
'num_leaves': 7, # we should let it be smaller than 2^(max_depth)
'max_depth': 4, # -1 means no limit
'min_child_samples': 100, # Minimum number of data need in a child(min_data_in_leaf)
'max_bin': 100, # Number of bucketed bin for feature values
'subsample': 0.9, # Subsample ratio of the training instance.
'subsample_freq': 1, # frequence of subsample, <=0 means no enable
'colsample_bytree': 0.7, # Subsample ratio of columns when constructing each tree.
'min_child_weight': 0, # Minimum sum of instance weight(hessian) needed in a child(leaf)
'min_split_gain': 0, # lambda_l1, lambda_l2 and min_gain_to_split to regularization
'nthread': 8,
'verbose': 0,
'scale_pos_weight':150, # because training data is extremely unbalanced
}
# split the data
dtrain = lgb.Dataset(train_df[predictors].values,
label=train_df[target].values,
feature_name=predictors)
dvalid = lgb.Dataset(valid_df[predictors].values,
label=valid_df[target].values,
feature_name=predictors)
# running the model
evals_results = {}
model = lgb.train(params,
dtrain,
valid_sets=[dtrain, dvalid],
valid_names=['train','valid'],
evals_result=evals_results,
num_boost_round=MAX_ROUNDS,
early_stopping_rounds=2*EARLY_STOP,
verbose_eval=VERBOSE_EVAL,
feval=None)
# feature importance
fig, (ax) = plt.subplots(ncols=1, figsize=(8,5))
lgb.plot_importance(model, height=0.8, title="Features importance (LightGBM)", ax=ax,color="red")
plt.show()
preds = model.predict(test_df[predictors])
# let's calculate ROC-AUC
roc_auc_score(test_df[target].values, preds)
# accuracy_score(test_df[target].values, preds)
Training and validation using cross-validation Let's use now cross-validation. We will use cross-validation (KFolds) with 5 folds. Data is divided in 5 folds and, by rotation, we are training using 4 folds (n-1) and validate using the 5th (nth) fold.
Test set is calculated as an average of the predictions
from sklearn.model_selection import KFold
import gc
kf = KFold(n_splits = NUMBER_KFOLDS, random_state = RANDOM_STATE, shuffle = True)
# Create arrays and dataframes to store results
oof_preds = np.zeros(train_df.shape[0])
test_preds = np.zeros(test_df.shape[0])
feature_importance_df = pd.DataFrame()
n_fold = 0
for train_idx, valid_idx in kf.split(train_df):
train_x, train_y = train_df[predictors].iloc[train_idx],train_df[target].iloc[train_idx]
valid_x, valid_y = train_df[predictors].iloc[valid_idx],train_df[target].iloc[valid_idx]
evals_results = {}
model = LGBMClassifier(
nthread=-1,
n_estimators=2000,
learning_rate=0.01,
num_leaves=80,
colsample_bytree=0.98,
subsample=0.78,
reg_alpha=0.04,
reg_lambda=0.073,
subsample_for_bin=50,
boosting_type='gbdt',
is_unbalance=False,
min_split_gain=0.025,
min_child_weight=40,
min_child_samples=510,
objective='binary',
metric='auc',
silent=-1,
verbose=-1,
feval=None)
model.fit(train_x, train_y, eval_set=[(train_x, train_y), (valid_x, valid_y)],
eval_metric= 'auc', verbose= VERBOSE_EVAL, early_stopping_rounds= EARLY_STOP)
oof_preds[valid_idx] = model.predict_proba(valid_x, num_iteration=model.best_iteration_)[:, 1]
test_preds += model.predict_proba(test_df[predictors], num_iteration=model.best_iteration_)[:, 1] / kf.n_splits
fold_importance_df = pd.DataFrame()
fold_importance_df["feature"] = predictors
fold_importance_df["importance"] = clf.feature_importances_
fold_importance_df["fold"] = n_fold + 1
feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
print('Fold %2d AUC : %.6f' % (n_fold + 1, roc_auc_score(valid_y, oof_preds[valid_idx])))
del model, train_x, train_y, valid_x, valid_y
gc.collect()
n_fold = n_fold + 1
train_auc_score = roc_auc_score(train_df[target], oof_preds)
print('Full AUC score %.6f' % train_auc_score)
The AUC score for the prediction from the test data was 0.943
pred = test_preds
Isolation Forest, Local Outlier Factor, Support Vector Machine
## Take some sample of the data
data1= df.sample(frac = 0.1,random_state=1)
data1.shape
df.shape
## Get the Fraud and the normal dataset
fraud = df[df['Class']==1]
genuine = df[df['Class']==0]
outlier_fraction = len(fraud)/float(len(genuine))
#Create independent and Dependent Features
columns = data1.columns.tolist()
# Filter the columns to remove data we do not want
columns = [c for c in columns if c not in ["Class"]]
# Store the variable we are predicting
target = "Class"
# Define a random state
state = np.random.RandomState(42)
X = data1[columns]
Y = data1[target]
X_outliers = state.uniform(low=0, high=1, size=(X.shape[0], X.shape[1]))
# Print the shapes of X & Y
print(X.shape)
print(Y.shape)
Model Prediction
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.svm import OneClassSVM
from sklearn.metrics import accuracy_score
##Define the outlier detection methods
classifiers = {
"Isolation Forest":IsolationForest(n_estimators=100, max_samples=len(X),
contamination=outlier_fraction,random_state=state, verbose=0),
"Local Outlier Factor":LocalOutlierFactor(n_neighbors=20, algorithm='auto',
leaf_size=30, metric='minkowski',
p=2, metric_params=None, contamination=outlier_fraction),
"Support Vector Machine":OneClassSVM(kernel='rbf', degree=3, gamma=0.1,nu=0.05,
max_iter=-1)
}
type(classifiers)
from sklearn.metrics import confusion_matrix
n_outliers = len(fraud)
for i, (clf_name,clf) in enumerate(classifiers.items()):
#Fit the data and tag outliers
if clf_name == "Local Outlier Factor":
y_pred = clf.fit_predict(X)
scores_prediction = clf.negative_outlier_factor_
elif clf_name == "Support Vector Machine":
clf.fit(X)
y_pred = clf.predict(X)
else:
clf.fit(X)
scores_prediction = clf.decision_function(X)
y_pred = clf.predict(X)
#Reshape the prediction values to 0 for Valid transactions , 1 for Fraud transactions
y_pred[y_pred == 1] = 0
y_pred[y_pred == -1] = 1
n_errors = (y_pred != Y).sum()
# Run Classification Metrics
print("{}: {}".format(clf_name,n_errors))
print("Confusion matrix :")
print(confusion_matrix(Y,y_pred))
print("ROC AUC Score :")
print(roc_auc_score(Y,y_pred))
print("Accuracy Score :")
print(accuracy_score(Y,y_pred))
print(" ")
Conclusion
We investigated the data, checking for data unbalancing, visualizing the features and understanding the relationship between different features. We then investigated two predictive models. The data was split in 3 parts, a train set, a validation set and a test set. For the first three models, we only used the train and test set.
We started with RandomForrestClassifier, for which we obtained an AUC scode of 0.85 when predicting the target for the test set.
We followed with an AdaBoostClassifier model, with lower AUC score (0.83) for prediction of the test set target values.
We then followed with an CatBoostClassifier, with the AUC score after training 500 iterations 0.86.
We then experimented with a XGBoost model. In this case, se used the validation set for validation of the training model. The AUC score obtained was 0.98.
We then presented the data to a LightGBM model. We used both train-validation split and cross-validation to evaluate the model effectiveness to predict 'Class' value, i.e. detecting if a transaction was fraudulent. With the first method we obtained values of AUC for the validation set around 0.978. For the test set, the score obtained was 0.947. With the cross-validation, we obtained an AUC score for the test prediction of 0.943.
The last 3 models we used here prove that accuracy score is not always the best metric. These 3 models was ranked 3 smallest ROC AUC score with Isolation Forest 0.63, Local Outlier Factor 0.51, and SVM 0.53